24 Hour News-cycles are Exhausting so Lets Automate


What is Web Scraping?



Web scraping is a great way to automate data collection in R. Luckily it’s also pretty easy to scrape infomation from html pages using the code below. Lets get started.



Step 1: Load the packages!


Step 2: Write the functions for scraping from our selected news sources

We’re using the css elements corresponding to the bodies of text that we want to scrape and setting the function input as the specific url of the page that we want to scrape from.

As I only wrote this to read some headlines, I’ve limited the length of each object to 10 so that the dataframe won’t thow any errors of the lengths of each object are different. Feel free to change this.


Step 3: Lets compile the functions we wrote above to make them extra speedy


Step 4: Create the data objects containing the bodies of text that we scraped

If you want to write something similar, you can always just copy this code and change the urls and the object names, it will work as long as the css elements we used in step 2 are corect.


Step 5: Use the data objects above to inialize the final dataframe


Step 6: Finalize the dataframe


Step 7: Style the widgit we’ll use to actually read the scraped data

You can also make this an r script, then add saveWidget(tweets, "nameOfThisWidgit.html") as the last line, so that after the code executes, the widgit will be saved by its self in an html file.

This has the added benefit of being able to download our data as a csv file which is a nice benefit. This widgit also has search functionality, which is also a nice touch if you planning on scraping large volumes of data.